-
Notifications
You must be signed in to change notification settings - Fork 843
wildcard expansion in vsearch bug fix #8307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gzip -n ${prefix}.${out_ext}* was throwing an "arguments too long" error. I've fixed it by using find to get all the needed files and pipe them to xargs gzip. It was still acting oddly on my computer (some files ended up with two, three, or four ".gz"s at the end - so it must have been passing them to gzip multiple times). That was easily fixed by adding the regex for a single digit ("[0-9]") at the end of the pattern.
wildcard expansion throwing an "arguments too long" error
@@ -60,7 +60,7 @@ process VSEARCH_CLUSTER { | |||
|
|||
if [[ $args3 == "--clusters" ]] | |||
then | |||
gzip -n ${prefix}.${out_ext}* | |||
find . -name \"${prefix}.${out_ext}*[0-9]\" | xargs gzip -n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be consistent throughout, if you've had this issue with --clusters
, others might have the same issue with --samout
. Also update that one no?
Also is it a given that vsearch always append a single digit to the end of the file?
Might also be could to specify to find that we're looking for files with -type f
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added in the -type f
good call on that.
The --samout
bit doesn't use a wildcard expansion like the --clusters
does, so I think it should be ok as is. vsearch outputs many single-entry fastas, one for each centroid, whereas I believe that samtools makes a single multi-entry fasta so no wildcard expansion is needed. Happy to discuss further if I've missed something.
vsearch does append digits to the end when that --clusters
flag is set, starting with 0
and counting up where there's one for each cluster centroid:
ASV_post_clustering.clusters.fasta0
ASV_post_clustering.clusters.fasta10000
ASV_post_clustering.clusters.fasta10001
ASV_post_clustering.clusters.fasta10002
ASV_post_clustering.clusters.fasta10003
...
So anchoring the regex with a final trailing digit does match all of those. The gzip then appends .gz
and so the files end up looking like
ASV_post_clustering.clusters.fasta0.gz
ASV_post_clustering.clusters.fasta10000.gz
ASV_post_clustering.clusters.fasta10001.gz
ASV_post_clustering.clusters.fasta10002.gz
ASV_post_clustering.clusters.fasta10003.gz
...
which will not be matched again as now they no longer end with a digit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, then it's good to go I believe!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
nice safety mechanism.
A wildcard expansion is throwing an "arguments too long" error
gzip -n ${prefix}.${out_ext}*
was throwing an "arguments too long" error - my dataset is quite large and there's over 50k files trying to be gzipped at this step.I've fixed it by using
find
to get all the needed files and pipe them toxargs gzip
.It was still acting oddly on the HPC: some files ended up with two, three, or four ".gz"s at the end - so
find
must have been passing them toxargs gzip
multiple times. Seems odd. Anyway that was easily fixed by adding the regex for a single digit ([0-9]
) at the end of the pattern.PR checklist
Closes #8305
no test data added, but I've run it on my dataset and it works. Happy to discuss if needed.